1 Explanation of the dataset

1.1 Motivation

As I introduced in the README, I found concerning the fact of some workers leaving a company for several reasons: The future absence of a key player of the company, a non-amortised investment in personnel training, … So I decided to look into several dataset repositories in order to find a dataset that would fit my needs of analysing this phenomenon. I arrived at this dataset at Kaggle and I think it matches and fulfills my needs, as it has both categorical and numerical data to examine and addresses this exact same topic.

1.2 Variables

The variables that make part of this dataset are the following:

  • Age, worker’s age.

  • Attrition, target variable, it indicates if the worker is willing to abandon his/her company.

  • BusinessTravel, how often a worker travels due to work reasons.

  • DailyRate, daily rate of the employee.

  • Department, current department of the employee.

  • DistanceFromHome, distance (in miles) between the worker’s home and workplace.

  • Education, level of studies of the employee.

  • EducationField, choice of study of the employee.

  • EmployeeCount, constant 1.

  • EmployeeNumber, primary key of the observation.

  • EnvironmentSatisfaction, ordinal variable from 1 to 4 for Employee Satisfaction.

  • Gender, binary gender of the employee.

  • HourlyRate, hourly rate.

  • JobInvolvement, ordinal variable from 1 to 4 for employee involvement.

  • JobLevel, ordinal variable from 1 to 4 for employee performance.

  • JobRole, employee role at the company.

  • MaritalStatus, marital status of the employee.

  • MonthlyIncome, monthly income of the employee.

  • MonthlyRate, monthly rate of the employee.

  • NumCompaniesWorked, how many companies the employee has worked for.

  • Over18, constant 1.

  • OverTime, boolean to check if the employee does overtime.

  • PercentSalaryHike, salary rise since started working for the company.

  • PerformanceRating, ordinal variable from 1 to 4 with the perceived performance of the employee.

  • RelationshipSatisfaction, ordinal variable from 1 to 4 for the satisfaction of the employee with his/her company.

  • StandardHours, constant 80.

  • StockOptionLevel, ordinal variable from 1 to 4 with the easeness to acquire stock of the company.

  • TotalWorkingYears, employee number of years worked through his/her life.

  • WorkLifeBalance, ordinal variable from 1 to 4 with the level of work/life balance of the employee.

  • YearsAtCompany, employee number of years worked at the company.

  • YearsInCurrentRole, employee number of years worked at his/her current role.

  • YearsSinceLastPromotion, employee number of years worked since last promotion.

  • YearsWithCurrManager, employee number of years worked with the same manager.

2 Data Loading

# Library loading
if(!require(dplyr)){
    install.packages('dplyr', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(dplyr)
}
if(!require(ggplot2)){
    install.packages('ggplot2', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(ggplot2)
}
if(!require(reshape2)){
    install.packages('reshape2', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(reshape2)
}
if(!require(grid)){
    install.packages('grid', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(grid)
}
if(!require(gridExtra)){
    install.packages('gridExtra', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(gridExtra)
}
if(!require(plotly)){
    install.packages('plotly', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(plotly)
}
if(!require(caTools)){
    install.packages('caTools', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(caTools)
}
if(!require(e1071)){
    install.packages('e1071', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(e1071)
}
if(!require(randomForest)){
    install.packages('randomForest', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(randomForest)
}
if(!require(caret)){
    install.packages('caret', dependencies=c("Depends", "Imports"), repos='http://cran.es.r-project.org')
    require(caret)
}

# Indicates the seed to replicate results
set.seed(1234)

# Dataset loading
dataset <- read.csv('../data/WA_Fn-UseC_-HR-Employee-Attrition.csv')

# Dropping the columns without value for our analysis
col_drops <- c('EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours')

for (column in col_drops){
  dataset[,column] <- NULL
}

# Conversion of strings to Factors in categorical columns
factores <- c("Attrition",
              "BusinessTravel",
              "Department",
              "EducationField",
              "Gender",
              "JobRole",
              "MaritalStatus",
              "OverTime")

for(f in factores){
  dataset[, f] <- as.factor(dataset[, f])
}

3 Exploratory Data Analysis

First of all, we check if the dataset has null values on it:

sprintf("Number of NULL values in the dataset: %i", sum(is.na(dataset)))
## [1] "Number of NULL values in the dataset: 0"
sprintf("Number of ' ' values in the dataset: %i", sum(dataset == ' '))
## [1] "Number of ' ' values in the dataset: 0"
sprintf("Number of '?' values in the dataset: %i", sum(dataset == '?'))
## [1] "Number of '?' values in the dataset: 0"
sprintf("Number of '' values in the dataset: %i", sum(dataset == ''))
## [1] "Number of '' values in the dataset: 0"

And as we can see, no null values are detected on this dataset, which makes us believe that data completition in this dataset would be high or total if no future problems are detected on it.

In order to have a solid ground to start our analysis, we describe the values and data types of the columns of the set:

str(dataset)
## 'data.frame':    1470 obs. of  31 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
summary(dataset)
##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome   Education    
##  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
##  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 :446   Median : 7.000   Median :3.000  
##                               Mean   : 9.193   Mean   :2.913  
##                               3rd Qu.:14.000   3rd Qu.:4.000  
##                               Max.   :29.000   Max.   :5.000  
##                                                               
##           EducationField EnvironmentSatisfaction    Gender      HourlyRate    
##  Human Resources : 27    Min.   :1.000           Female:588   Min.   : 30.00  
##  Life Sciences   :606    1st Qu.:2.000           Male  :882   1st Qu.: 48.00  
##  Marketing       :159    Median :3.000                        Median : 66.00  
##  Medical         :464    Mean   :2.722                        Mean   : 65.89  
##  Other           : 82    3rd Qu.:4.000                        3rd Qu.: 83.75  
##  Technical Degree:132    Max.   :4.000                        Max.   :100.00  
##                                                                               
##  JobInvolvement    JobLevel                          JobRole    JobSatisfaction
##  Min.   :1.00   Min.   :1.000   Sales Executive          :326   Min.   :1.000  
##  1st Qu.:2.00   1st Qu.:1.000   Research Scientist       :292   1st Qu.:2.000  
##  Median :3.00   Median :2.000   Laboratory Technician    :259   Median :3.000  
##  Mean   :2.73   Mean   :2.064   Manufacturing Director   :145   Mean   :2.729  
##  3rd Qu.:3.00   3rd Qu.:3.000   Healthcare Representative:131   3rd Qu.:4.000  
##  Max.   :4.00   Max.   :5.000   Manager                  :102   Max.   :4.000  
##                                 (Other)                  :215                  
##   MaritalStatus MonthlyIncome    MonthlyRate    NumCompaniesWorked OverTime  
##  Divorced:327   Min.   : 1009   Min.   : 2094   Min.   :0.000      No :1054  
##  Married :673   1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000      Yes: 416  
##  Single  :470   Median : 4919   Median :14236   Median :2.000                
##                 Mean   : 6503   Mean   :14313   Mean   :2.693                
##                 3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000                
##                 Max.   :19999   Max.   :26999   Max.   :9.000                
##                                                                              
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StockOptionLevel
##  Min.   :11.00     Min.   :3.000     Min.   :1.000            Min.   :0.0000  
##  1st Qu.:12.00     1st Qu.:3.000     1st Qu.:2.000            1st Qu.:0.0000  
##  Median :14.00     Median :3.000     Median :3.000            Median :1.0000  
##  Mean   :15.21     Mean   :3.154     Mean   :2.712            Mean   :0.7939  
##  3rd Qu.:18.00     3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:1.0000  
##  Max.   :25.00     Max.   :4.000     Max.   :4.000            Max.   :3.0000  
##                                                                               
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##                                                                          
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000      
## 

And we can combine the summary table with the representation of the histogram of the continuos variables, in order to easier detect outliers and perceive the distribution of the different characteristics.

melt.dataset <- melt(dataset)

ggplot(data = melt.dataset, aes(x = value)) + 
  stat_density() + 
  facet_wrap(~variable, scales = "free")

We observe that the dataset is skewed to the left part of the histograms, which makes sense as most of the workers receive the less benefits inside organizations. At the same time, we can clearly see that most workers live more or less near their workplaces to achieve the most convenience. On the other hand, there are more or less constant variables (such as Rate) and some that follow a normal distribution (as Age, which is normal in a various range of datasets).

3.1 Categorical Variables

With the summary of the dataset on mind, we can now check the distribution of our target variable accross different factors. I selected three of them as interesting due to their importance in our quotidian lives:

  1. Gender
  2. OverTime
  3. Marital Status
grid.newpage()

plotbyGender <- ggplot(dataset,aes(Gender,fill=Attrition))+geom_bar() +labs(x="Gender", y="Employees")+ guides(fill=guide_legend(title=""))+ scale_fill_manual(values=c("black","#008000"))+ggtitle("By Gender")
plotbyOverTime <- ggplot(dataset,aes(OverTime,fill=Attrition))+geom_bar() +labs(x="OverTime", y="Employees")+ guides(fill=guide_legend(title=""))+ scale_fill_manual(values=c("black","#008000"))+ggtitle("By OverTime")
plotbyMaritalStatus <- ggplot(dataset,aes(MaritalStatus,fill=Attrition))+geom_bar() +labs(x="MaritalStatus", y="Employees")+ guides(fill=guide_legend(title=""))+ scale_fill_manual(values=c("black","#008000"))+ggtitle("By MaritalStatus")+theme(axis.text.x = element_text(angle=45))

grid.arrange(plotbyGender,plotbyOverTime,plotbyMaritalStatus,ncol=3)

And, more generally, we will also check the value of the columns with a considerable number of factors to see if they are relevant enough to be considered in our modelling. The columns in this case are:

3.1.1 Education Field

qplot <- ggplot(dataset, aes(fill=EducationField, x=Attrition)) + 
  geom_bar(position="fill") + 
  ggtitle("EducationField by Attrition in %")
ggplotly(qplot)

We saw differences in proportion for Marketing and Technical Degrees as EducationField, as these fields are more representative in employees with attrition, and Medical, on the contrary as it is more represented on employeed not considering a career change.

3.1.2 Job Role

qplot <- ggplot(dataset, aes(fill=JobRole, x=Attrition)) + 
  geom_bar(position="fill") + 
  ggtitle("JobRole by Attrition in %")
ggplotly(qplot)

Regarding JobRole, Sales Representative and Laboratory Technician employees tend to be more attire for a change in their job posts, and on the contrary Manager and Manufacturing Director are more representative on the negative value of the target value.

3.2 Continuos Variables

And now would be the time to do the same, this time using a continuous variable. In this case, I have chosen Age as dimension to pivot by in order to get a clear idea of attrition through this variable.

3.2.1 Age

qplot <- ggplot(dataset, aes(x = Age)) +
            geom_density(binwidth = 10,
                           aes(y = ..density..,
                               fill = Attrition),
                           position = "dodge", alpha = 0.5) + 
            ggtitle("Age Density per Attrition")
ggplotly(qplot)

We observe that people inclined towards attrition has a distribution more inclined in younger ages, highlighting its mode on 30-31 years and much lower distributions for older people than 40. People not thinking about changing job posts tend to be older and with a much more distributed age, close to the normal distribution (though with a wider cue on older ages).

3.2.2 MontlyIncome

qplot <- ggplot(dataset, aes(x = MonthlyIncome)) +
            geom_density(binwidth = 10,
                           aes(y = ..density..,
                               fill = Attrition),
                           position = "dodge", alpha = 0.5) + 
            ggtitle("Monthly Income Density per Attrition")
ggplotly(qplot)

On the other hand, as it seems obvious, we could check the density of MonthlyIncome differentiating by Attrition too, and here we can check that density of salaries of workers thinking about quitting is much skewed towards lower salaries, with a very noticeable peak around $2500 per month. Although people without attrition has also salaries concentrated around $2500-6000, it is by far not as clear and corresponds to the natural distribution on Income in developed countries.

4 Data Preparation

We need to prepare our data specifically to perform our modelling as efficiently as possible:

# Copying our dataset for changes to modelling
dataset_model <- dataset

# Converting binary columns to INT
dataset_model$Gender <- ifelse(dataset_model$Gender == 'Male', 1, 0)
dataset_model$OverTime <- ifelse(dataset_model$OverTime == 'Yes', 1, 0)

# Creating variables on EducationField and JobRole leaving the relevant values as booleans
dataset_model$MarketingTechnical <- ifelse(dataset_model$EducationField 
                                           %in% c('Marketing', 'Technical Degree'), 1, 0)
dataset_model$Medical <- ifelse(dataset_model$EducationField 
                                %in% c('Medical'), 1, 0)
dataset_model$LaboratorySales <- ifelse(dataset$JobRole 
                                        %in% c('Laboratory Technician','Sales Representative'), 1, 0)
dataset_model$Director <- ifelse(dataset$JobRole
                                 %in% c('Manager', 'Manufacturing Director'), 1, 0)


# Encoding rest of categorical columns
dv <- caret::dummyVars(" ~ BusinessTravel + Department + MaritalStatus", data = dataset_model)
new_columns <- data.frame(predict(dv, newdata = dataset_model))
dataset_model <- cbind(dataset_model, new_columns)

# Dropping columns with categorical values
dataset_model <- dataset_model[,c(-3, -5, -8,-14, -16, -38, -41, -44)]

# Scaling numerical columns
numerical_columns <- c('Age',
                       'DailyRate',
                       'DistanceFromHome',
                       'HourlyRate',
                       'MonthlyIncome',
                       'MonthlyRate',
                       'PercentSalaryHike',
                       'TotalWorkingYears',
                       'YearsAtCompany',
                       'YearsInCurrentRole',
                       'YearsWithCurrManager')

for (col in numerical_columns){
  dataset_model[, col] <- scale(dataset_model[, col])
}

# Splitting into training and test
sample = sample.split(dataset_model$Attrition, SplitRatio = 2/3)
train = subset(dataset_model, sample == TRUE)
test  = subset(dataset_model, sample == FALSE)

5 Modeling

As we try to predict a categorical value with a binary output, we will focus our efforts in models specialized in this kind of task. From simpler to more complicated, we will perform a total of 4 different models:

5.1 Logistic Regression

# Making of the model
log_model <- glm(formula = Attrition~.,
                 family = 'binomial',
                 data = train)

# Predicting the Test set results
prob_pred = predict(log_model, type = 'response', newdata = test[-2])
log_pred = ifelse(prob_pred > 0.35, 1, 0)

# Getting the main indicators of our model
summary(log_model)
## 
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6640  -0.4735  -0.2354  -0.0844   3.3747  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                        4.651810   2.010783   2.313 0.020699 *  
## Age                               -0.186526   0.150773  -1.237 0.216039    
## DailyRate                         -0.047300   0.111864  -0.423 0.672417    
## DistanceFromHome                   0.339339   0.112067   3.028 0.002462 ** 
## Education                         -0.092350   0.110806  -0.833 0.404597    
## EnvironmentSatisfaction           -0.566843   0.106835  -5.306 1.12e-07 ***
## Gender                             0.340214   0.233158   1.459 0.144521    
## HourlyRate                         0.042092   0.112465   0.374 0.708202    
## JobInvolvement                    -0.620997   0.155692  -3.989 6.65e-05 ***
## JobLevel                          -0.211723   0.389494  -0.544 0.586726    
## JobSatisfaction                   -0.406853   0.100989  -4.029 5.61e-05 ***
## MonthlyIncome                     -0.329159   0.436445  -0.754 0.450740    
## MonthlyRate                       -0.053648   0.112096  -0.479 0.632231    
## NumCompaniesWorked                 0.208151   0.048836   4.262 2.02e-05 ***
## OverTime                           2.046247   0.246004   8.318  < 2e-16 ***
## PercentSalaryHike                  0.008585   0.171671   0.050 0.960118    
## PerformanceRating                 -0.198380   0.497478  -0.399 0.690061    
## RelationshipSatisfaction          -0.320773   0.103693  -3.093 0.001978 ** 
## StockOptionLevel                  -0.071710   0.192633  -0.372 0.709698    
## TotalWorkingYears                 -0.469255   0.271122  -1.731 0.083489 .  
## TrainingTimesLastYear             -0.216489   0.091676  -2.361 0.018203 *  
## WorkLifeBalance                   -0.485512   0.160842  -3.019 0.002540 ** 
## YearsAtCompany                     0.427890   0.299202   1.430 0.152686    
## YearsInCurrentRole                -0.561984   0.207100  -2.714 0.006656 ** 
## YearsSinceLastPromotion            0.213513   0.056319   3.791 0.000150 ***
## YearsWithCurrManager              -0.397818   0.220935  -1.801 0.071764 .  
## MarketingTechnical                 0.743286   0.292840   2.538 0.011142 *  
## Medical                           -0.230989   0.270058  -0.855 0.392369    
## LaboratorySales                    0.920166   0.268413   3.428 0.000608 ***
## Director                           0.322959   0.458868   0.704 0.481547    
## BusinessTravel.Non.Travel         -1.058429   0.467620  -2.263 0.023609 *  
## BusinessTravel.Travel_Frequently   1.185401   0.262721   4.512 6.42e-06 ***
## Department.Human.Resources         0.425736   0.568342   0.749 0.453806    
## Department.Research...Development -0.568613   0.275673  -2.063 0.039147 *  
## MaritalStatus.Divorced            -1.251393   0.435912  -2.871 0.004095 ** 
## MaritalStatus.Married             -1.138989   0.317375  -3.589 0.000332 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 865.72  on 979  degrees of freedom
## Residual deviance: 551.66  on 944  degrees of freedom
## AIC: 623.66
## 
## Number of Fisher Scoring iterations: 6
# Making the Confusion Matrix
log_cm = table(test[, 2], log_pred > 0.35)
log_cm
##      
##       FALSE TRUE
##   No    379   32
##   Yes    36   43
# Calculating the precision, the accuracy and the recall
log_pre = (log_cm[1,1] + log_cm[2,2]) / sum(log_cm)
log_acc = log_cm[2,2] / (log_cm[2,2] + log_cm[1,2]);
log_rec = log_cm[2,2] / (log_cm[2,2] + log_cm[2,1]);

log_stats = cbind(precision = log_pre,
                  accuracy = log_acc,
                  recall = log_rec)
log_stats
##      precision  accuracy    recall
## [1,] 0.8612245 0.5733333 0.5443038

5.2 Decision Trees

# Making of the model
tree_model <- C50::C5.0(train[,-2], train[,2], rules=FALSE)

# Showing model stats
summary(tree_model)
## 
## Call:
## C5.0.default(x = train[, -2], y = train[, 2], rules = FALSE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sat Feb 27 01:02:11 2021
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 980 cases (36 attributes) from undefined.data
## 
## Decision tree:
## 
## TotalWorkingYears <= -1.321152:
## :...BusinessTravel.Travel_Frequently > 0: Yes (10/1)
## :   BusinessTravel.Travel_Frequently <= 0:
## :   :...TrainingTimesLastYear <= 1: Yes (4)
## :       TrainingTimesLastYear > 1:
## :       :...MaritalStatus.Divorced > 0: No (10)
## :           MaritalStatus.Divorced <= 0:
## :           :...MarketingTechnical > 0: Yes (8/1)
## :               MarketingTechnical <= 0:
## :               :...StockOptionLevel > 0: No (5)
## :                   StockOptionLevel <= 0:
## :                   :...OverTime > 0: Yes (5)
## :                       OverTime <= 0:
## :                       :...WorkLifeBalance > 2: No (13/1)
## :                           WorkLifeBalance <= 2:
## :                           :...WorkLifeBalance <= 1: No (2)
## :                               WorkLifeBalance > 1: Yes (4)
## TotalWorkingYears > -1.321152:
## :...OverTime <= 0:
##     :...WorkLifeBalance <= 1:
##     :   :...MaritalStatus.Married > 0: No (17/1)
##     :   :   MaritalStatus.Married <= 0:
##     :   :   :...Education > 3: Yes (4)
##     :   :       Education <= 3:
##     :   :       :...EnvironmentSatisfaction <= 2: Yes (4/1)
##     :   :           EnvironmentSatisfaction > 2: No (6)
##     :   WorkLifeBalance > 1:
##     :   :...BusinessTravel.Travel_Frequently <= 0: No (513/32)
##     :       BusinessTravel.Travel_Frequently > 0:
##     :       :...DailyRate > -1.13129: No (100/11)
##     :           DailyRate <= -1.13129:
##     :           :...PerformanceRating > 3: No (5)
##     :               PerformanceRating <= 3:
##     :               :...Director <= 0: Yes (9/1)
##     :                   Director > 0: No (2)
##     OverTime > 0:
##     :...MonthlyIncome > -0.6367797:
##         :...EnvironmentSatisfaction > 2: No (129/12)
##         :   EnvironmentSatisfaction <= 2:
##         :   :...NumCompaniesWorked > 8: Yes (4)
##         :       NumCompaniesWorked <= 8:
##         :       :...JobSatisfaction > 3: No (24/2)
##         :           JobSatisfaction <= 3:
##         :           :...JobInvolvement <= 1: Yes (4)
##         :               JobInvolvement > 1:
##         :               :...MarketingTechnical > 0: Yes (4/1)
##         :                   MarketingTechnical <= 0:
##         :                   :...DistanceFromHome <= 1.45648:
##         :                       :...MonthlyIncome <= -0.495317: Yes (2)
##         :                       :   MonthlyIncome > -0.495317: No (20)
##         :                       DistanceFromHome > 1.45648:
##         :                       :...MonthlyRate <= 0.6860134: Yes (6)
##         :                           MonthlyRate > 0.6860134: No (2)
##         MonthlyIncome <= -0.6367797:
##         :...BusinessTravel.Travel_Frequently > 0: Yes (17/3)
##             BusinessTravel.Travel_Frequently <= 0:
##             :...JobInvolvement > 3: No (5)
##                 JobInvolvement <= 3:
##                 :...JobInvolvement <= 1: Yes (4)
##                     JobInvolvement > 1:
##                     :...YearsInCurrentRole <= -0.8912861: Yes (5)
##                         YearsInCurrentRole > -0.8912861:
##                         :...Department.Human.Resources > 0: No (2)
##                             Department.Human.Resources <= 0:
##                             :...MaritalStatus.Married > 0: No (10/2)
##                                 MaritalStatus.Married <= 0:
##                                 :...YearsSinceLastPromotion > 4: Yes (2)
##                                     YearsSinceLastPromotion <= 4:
##                                     :...RelationshipSatisfaction > 3: No (7/1)
##                                         RelationshipSatisfaction <= 3: [S1]
## 
## SubTree [S1]
## 
## Department.Research...Development <= 0: Yes (2)
## Department.Research...Development > 0:
## :...DistanceFromHome <= -0.8872132: No (3)
##     DistanceFromHome > -0.8872132: Yes (7/1)
## 
## 
## Evaluation on training data (980 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      38   71( 7.2%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     813     9    (a): class No
##      62    96    (b): class Yes
## 
## 
##  Attribute usage:
## 
##  100.00% TotalWorkingYears
##   96.22% OverTime
##   76.94% BusinessTravel.Travel_Frequently
##   69.29% WorkLifeBalance
##   26.43% MonthlyIncome
##   20.92% EnvironmentSatisfaction
##   11.84% DailyRate
##    8.67% JobInvolvement
##    7.24% MarketingTechnical
##    6.73% NumCompaniesWorked
##    6.33% JobSatisfaction
##    6.33% MaritalStatus.Married
##    5.20% TrainingTimesLastYear
##    4.80% MaritalStatus.Divorced
##    4.08% DistanceFromHome
##    3.88% YearsInCurrentRole
##    3.37% Department.Human.Resources
##    2.96% StockOptionLevel
##    2.14% YearsSinceLastPromotion
##    1.94% RelationshipSatisfaction
##    1.63% PerformanceRating
##    1.43% Education
##    1.22% Department.Research...Development
##    1.12% Director
##    0.82% MonthlyRate
## 
## 
## Time: 0.0 secs
# Predicting test data
tree_pred <- predict(tree_model, newdata = test)

# Making the Confusion Matrix
tree_cm = table(test[, 2], tree_pred)
tree_cm
##      tree_pred
##        No Yes
##   No  375  36
##   Yes  55  24
# Calculating the precision, the accuracy and the recall
tree_pre = (tree_cm[1,1] + tree_cm[2,2]) / sum(tree_cm)
tree_acc = tree_cm[2, 2] / (tree_cm[2,2] + tree_cm[1,2]);
tree_rec = tree_cm[2, 2] / (tree_cm[2,2] + tree_cm[2,1]);

tree_stats = cbind(precision = tree_pre,
                   accuracy = tree_acc,
                   recall = tree_rec)
tree_stats
##      precision accuracy    recall
## [1,] 0.8142857      0.4 0.3037975

5.3 SVM Classifiers

# Making of the model
svm_model = svm(formula = Attrition ~ .,
                data = train,
                type = 'C-classification',
                kernel = 'radial')

# Predicting the Test set results
svm_pred = predict(svm_model, newdata = test[-2])

# Making the Confusion Matrix
svm_cm = table(test[, 2], svm_pred)
svm_cm
##      svm_pred
##        No Yes
##   No  405   6
##   Yes  65  14
# Calculating the precision, the accuracy and the recall
svm_pre = (svm_cm[1,1] + svm_cm[2,2]) / sum(svm_cm)
svm_acc = svm_cm[2, 2] / (svm_cm[2,2] + svm_cm[1,2]);
svm_rec = svm_cm[2, 2] / (svm_cm[2,2] + svm_cm[2,1]);

svm_stats = cbind(precision = svm_pre,
                  accuracy = svm_acc,
                  recall = svm_rec)
svm_stats
##      precision accuracy    recall
## [1,]  0.855102      0.7 0.1772152

5.4 Random Forests

# Making of the model
rfor_model = randomForest(x = train[-2],
                          y = train$Attrition,
                          ntree = 500)

# Predicting the Test set results
rfor_pred = predict(rfor_model, newdata = test[-2])

# Making the Confusion Matrix
rfor_cm = table(test[, 2], rfor_pred)
rfor_cm
##      rfor_pred
##        No Yes
##   No  409   2
##   Yes  68  11
# Calculating the precision, the accuracy and the recall
rfor_pre = (rfor_cm[1,1] + rfor_cm[2,2]) / sum(rfor_cm)
rfor_acc = rfor_cm[2, 2] / (rfor_cm[2,2] + rfor_cm[1,2]);
rfor_rec = rfor_cm[2, 2] / (rfor_cm[2,2] + rfor_cm[2,1]);

rfor_stats = cbind(precision = rfor_pre,
                   accuracy = rfor_acc,
                   recall = rfor_rec)
rfor_stats
##      precision  accuracy    recall
## [1,] 0.8571429 0.8461538 0.1392405

6 Conclusions

After our modeling was complete, we may think of the results in perspective. In a company environment, losing our talent maybe a step into the abyss, as we may lose some of our key players and thus that competitive advantage, and we may end up serving them to our competitors, which most probably would be catastrophic to our survival in the market. Besides, policies to reduce attrition are not as expensive as other kind of corporate investments are, and they are a perfect substitute for money invested in recruiters instead.

With these two ideas in mind, it is clear that we may be favoriting a model that, given a good precision, also maintains an acceptable recall, as False Negatives are more dangerous than False Positives in this case (the money lost in retaining an employee who is already a supporter is not really lost, and compared with the money lost due to attrition is tiny). To decide our model, let’s remember our scores:

# Combining them into a single data.frame
scores <- as.data.frame(rbind(log_stats, 
                        tree_stats,
                        svm_stats, 
                        rfor_stats))
scores$model <- c("Logistic Regression",
                  "Decision Trees",
                  "SVM",
                  "Random Forest")
scores <- scores[,c(4,1,2,3)]
scores
##                 model precision  accuracy    recall
## 1 Logistic Regression 0.8612245 0.5733333 0.5443038
## 2      Decision Trees 0.8142857 0.4000000 0.3037975
## 3                 SVM 0.8551020 0.7000000 0.1772152
## 4       Random Forest 0.8571429 0.8461538 0.1392405
# Plotting the results
melted <- melt(scores[, c('model', 'precision', 'accuracy', 'recall')])
q <- ggplot(data = melted, aes(x = model, y = value)) +
  geom_bar(aes(fill = variable), stat = 'identity', position = 'dodge') +
  xlab('Model') + 
  ylab('Metric') +
  ggtitle('Classification metrics per model')
ggplotly(q)
q <- ggplot(data = melted, aes(x = variable, y = value)) +
  geom_boxplot() +
  geom_jitter(aes(color = model)) + 
  xlab('Metric') +
  ylab('Value') +
  ggtitle('Distribution of Metric per model')
ggplotly(q)

So, our Logistic Regression, even tho considered the simpliest model, also provides us the best precision and recall, which are the two metrics we are looking to maximise. For that reason, we should use this model to predict new cases of Attrition in our hypothetic company.